# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 5 15:10:01 2023"
I feel fairly comfortable working with RStudio as it is one of the main tools I use daily. However, after browsing trough the “R for health Data Science” I feel there are several ways to use R. I tend to structure my code in rather different fashion. Perhaps the most striking difference is the use of pipeline. Example file uses pipelines everywhere! where as I tend to avoid it as i feel it makes the code hard to follow even if it would shorten and simplify the code. For example if we create random data frame with tree columns
exampledata1=c(2,4,6,6,8,9,10)
exampledata2=c(70.2,44.0,62.7,106.1,81.1,96.1,10.8)
exampledata3=c(16,23,10,12,9,10,17)
df=data.frame(exampledata1,exampledata2,exampledata3)
print(df)
## exampledata1 exampledata2 exampledata3
## 1 2 70.2 16
## 2 4 44.0 23
## 3 6 62.7 10
## 4 6 106.1 12
## 5 8 81.1 9
## 6 9 96.1 10
## 7 10 10.8 17
and we want to caluclate mean of the 3rd column, I would use following command:
mean(df$exampledata3)
## [1] 13.85714
whereas excercise dataset would use command
df$exampledata3 %>% mean()
## [1] 13.85714
Both ways lead to same results, even if syntax is different.
Another difference is plotting. Excercise uses ggplot2 where I am used to use R base plots. I know ggplot2 is more versatile and widely used tool. How ever i find its logic confusing compared to base plots. I wish i will learn to use ggplot2 during this course.
Overall i cannot say how well exercise set worked as crash course as most of this was already familiar to me. How ever the parts about ggplot2 was my favorite as it helped me to understand logic behind commands. Using markup is not familiar to me - i have only used Rscripts earlier.
date()
## [1] "Tue Dec 5 15:10:01 2023"
First step is to read the data created during the data wrangling part. In this case it is stored on my computer.
learningdata=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/learning2014.csv")
head(learningdata)
## X gender Age Attitude deep stra surf Points
## 1 1 F 53 37 3.583333 3.375 2.583333 25
## 2 2 M 55 31 2.916667 2.750 3.166667 12
## 3 3 F 49 25 3.500000 3.625 2.250000 24
## 4 4 M 53 35 3.500000 3.125 2.250000 10
## 5 5 M 49 37 3.666667 3.625 2.833333 22
## 6 6 F 38 38 4.750000 3.625 2.416667 21
Data appears to be in good order. After importing data it is possible to inspect it.
str(learningdata)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
There are 8 different variables (X, gender, age, attitude, deep, stra, surf and points). Data includes different variables related to learning. X is number used to identify observations. It is irrelevant for the data exploration as it is not measurement etc. Just a tool to identify different persons. Therefore it can be excluded from graphical investigation. Scaterplot matrix is a useful tool to understand different relationships between variables. Gender could be treated as dummy-varable as it is ether male of female in this data. To clarify the visuals we can use different collours on both gender.
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learningdata[-1], mapping = aes(alpha=0.3, col=gender), lower = list(combo = wrap("facethist", bins = 20)))
p
Correlation between variables can be seen in the top/right part of the graph (total and gender separately). left/bottom part of the graph shows scatter plot. Axis on th middle shows the distribution of observations. Top row is a boxplot on effect of gender to different variables. left column is histogram of variable by gender. Top left corner is histogram of genders. There are several aspects we can observe from this one graph.
To create model we should select variables that have high correlation with explained variable (points in this case). How ever if 2 variables have high correlation with explained variable, but they are also highly correlated with each other, they should be not both used. In this case points are correlated with attitude. Other attributes have quite low correlation, but variables stra and surf has some level correlation with points, so we add them to model as well.
model = lm(Points ~ Attitude + stra +surf, data = learningdata)
summary(model)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learningdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
According to the summary table, surf has extraordinary large p-value and therefore model can be simplified.
model2 = lm(Points ~ Attitude + stra, data = learningdata)
summary(model2)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
If we drop the surf variable form the model, we get model more simple model with higher r-squared and smaller p-value. Now the instructions of this exercise advise that non significant terms should be removed. How ever the question of significance is not black and white. It is more about how much uncertainty we are ready to accept. In model2 variable stra has p-value if 0.089. If we use 95% confidence level, the term is ot significant (p > 0.05) but if we use 90% confidence, it is (p < 0,1). Luckily we have more tools as decide whether or not we wish to simplify the model further.
model3 = lm(Points ~ Attitude, data = learningdata)
summary(model3)
##
## Call:
## lm(formula = Points ~ Attitude, data = learningdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
If we simplify the model even further we start to loose some of validity of the model as adjusted p-value starts to decrease and therefore we might not want to only have one explanatory variable and therefore we are happy with model2.
We can further analyze model variables using Bayesian Information Criterion (BIC)
library(flexmix)
## Warning: package 'flexmix' was built under R version 4.2.3
## Loading required package: lattice
BIC(model)
## [1] 1046.051
BIC(model2)
## [1] 1041.486
BIC(model3)
## [1] 1039.323
According to BIC the model with only attitude as predictor is the best despite lower R-squared.
We can yet further use tools to compare different models. Command “step” drops predictors one by one until the smallest AIC value has reaches and further simplification would increase the AIC.
step(model)
## Start: AIC=557.4
## Points ~ Attitude + stra + surf
##
## Df Sum of Sq RSS AIC
## - surf 1 15.00 4559.4 555.95
## <none> 4544.4 557.40
## - stra 1 69.61 4614.0 557.93
## - Attitude 1 980.95 5525.3 587.85
##
## Step: AIC=555.95
## Points ~ Attitude + stra
##
## Df Sum of Sq RSS AIC
## <none> 4559.4 555.95
## - stra 1 81.74 4641.1 556.90
## - Attitude 1 1051.89 5611.3 588.41
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
##
## Coefficients:
## (Intercept) Attitude stra
## 8.9729 0.3466 0.9137
Smallest AIC score is achieved with Attitude and stra as predictors (=model2), suggesting it is the best model. It probably could be argued that model2 and model3 are both valid, but i would prefer model2. So let’s take a closer look on that.
summary(model2)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
According to summary table model intercept is 8.97. Parameter for attitude is 0.35 and for stra 0.91. So the formula for model would be \(Points=8.97+attitude*0.35+stra*0.91\). This means that 1 unit increase in attitude, increases points 0.35 and one unit increase of stra increases the points by 0.91. If both attetude and stra is 0, model suggests that test subject recive about than 9 points.
R-squared explains how much variation of dependent variable is explained by independent variable. .Multiple R-squared explains how much variation in dependent variable is explained by the model variables. 1 would mean all variation is explained by it and 0 would mean no variation is explained. Multiple \(R^2\) of this model is 0.20, suggesting that 20 % of the variation is explained by the independent variables. In addition summary table provides a adjusted \(R^2\). This adjustment makes it easier to compare models with different ammounts of predictorors, as higher number of predictors tend to increase \(R^2\) even if increasing the number if predictors would not increase the quality of the model as predictors will explain some percent of variance - even by coincidence.
plot(model2, which = 1)
Residuals vs fitted values are used to check for un-linearity. Residuals describe the distance between observation and model. If residuals have increasing or decreasing trend, we have issues with the model. Ie. we should have used nonlinear regression instead of linear. We can also detect outliers using this plot. the average of the residuals are close to zero trough the graph and there are no clear patterns so all is good. How ever we can detect there are some outliers in the data illustrated at the lower middle part of the graph.
plot(model2, which = 2)
Normal Q-Q plot sorts observations by residuals into ascending order. Majority of observations are where one would expect them to be in normally distributed data. Only the tails are a bit off from expected values. How ever as we have limited sample size I would not be too worried about that. As a conclusion: data is or is close to be normally distributed.
plot(model2, which = 5)
Residuals vs Leverage is used to find influential observations. High leverage means changes in that observations would affect the model more compared to observations with low leverage. In this case it seems there are no highly influential observations as all of them are within cook’s distance (we can’t even see the threshold in this zoom). We can also see that residuals do not really change as a function of a leverage, indicating yet again normal distribution and that variances of observations have no patterns.
Disclaimer! I have had super busy week and unfortunately I did not have enough time to really focus this week and I am coding this last minute before deadline. So if and when i have made typos and silly mistakes, please be patient and I am sorry.
date()
## [1] "Tue Dec 5 15:10:11 2023"
First step is to read the data created during the data wrangling part. In this case it is stored on my computer.
alc=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/alcdata.csv")
head(alc)
## X school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 2 GP F 17 U GT3 T 1 1 at_home other
## 3 3 GP F 15 U LE3 T 1 1 at_home other
## 4 4 GP F 15 U GT3 T 4 2 health services
## 5 5 GP F 16 U GT3 T 3 3 other other
## 6 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime schoolsup famsup activities nursery
## 1 course mother 2 2 yes no no yes
## 2 course father 1 2 no yes no no
## 3 other mother 1 2 yes no no yes
## 4 home mother 1 3 no yes yes yes
## 5 home father 1 2 no yes no yes
## 6 reputation mother 1 2 no yes yes yes
## higher internet romantic famrel freetime goout Dalc Walc health failures paid
## 1 yes no no 4 3 4 1 1 3 0 no
## 2 yes yes no 5 3 3 1 1 3 0 no
## 3 yes yes no 4 3 2 2 3 3 2 yes
## 4 yes yes yes 3 2 2 1 1 5 0 yes
## 5 yes no no 4 3 2 1 2 5 0 yes
## 6 yes yes no 5 4 2 1 2 5 0 yes
## absences G1 G2 G3 alc_use high_use
## 1 5 2 8 8 1.0 FALSE
## 2 3 7 8 8 1.0 FALSE
## 3 8 10 10 11 2.5 TRUE
## 4 1 14 14 14 1.0 FALSE
## 5 2 8 12 12 1.5 FALSE
## 6 8 14 14 14 1.5 FALSE
Let’s remove the first column “x” as it is duplicate of row numbers
alc=alc[,-1]
Let’s take a look on our data
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
We have 35 columns describing information about students. Some variables describe background information. Variables G1, G2 and G3 decribes the grades of testsubjects. Alc_use tells how large is alcohol consumption and high use tells whether the consumption is high or not.
The 4 variables i would like to choose are absence, G3, pstatus and age. I assume that high absence, low grades, broken families and older age is linked to high consumption of alcohol.
First I would like to check how those variables are related to each other. But before that, let’s instal some packages!
library(tidyr); library(dplyr); library(ggplot2)
## Warning: package 'tidyr' was built under R version 4.2.3
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
table(alc$Pstatus, alc$high_use)
##
## FALSE TRUE
## A 26 12
## T 233 99
g1 <- ggplot(alc, aes(x = high_use, y = age))
g2 <- ggplot(alc, aes(x = high_use, y = G3))
g3 <- ggplot(alc, aes(x = high_use, y = absences))
g1 + geom_boxplot()
g2 + geom_boxplot()
g3 + geom_boxplot()
Using tables and boxplots we can assume
Therefore hypotheses presented seem to be correct.
m <- glm(high_use ~ G3 + absences + age + Pstatus, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ G3 + absences + age + Pstatus, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3352 -0.8239 -0.7050 1.2037 1.8781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.52154 1.82438 -1.382 0.166931
## G3 -0.06845 0.03602 -1.900 0.057371 .
## absences 0.08199 0.02335 3.511 0.000446 ***
## age 0.12084 0.10147 1.191 0.233666
## PstatusT 0.05517 0.39732 0.139 0.889572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 428.47 on 365 degrees of freedom
## AIC: 438.47
##
## Number of Fisher Scoring iterations: 4
According to the summary of created model only G3 and absences are statistically significant. Family status has only little to do with the alcohol consumption. More simple model would be better. For that we can use step-command that finds the model with smallest (=best) AIC score.
step(m)
## Start: AIC=438.47
## high_use ~ G3 + absences + age + Pstatus
##
## Df Deviance AIC
## - Pstatus 1 428.49 436.49
## - age 1 429.89 437.89
## <none> 428.47 438.47
## - G3 1 432.09 440.09
## - absences 1 442.85 450.85
##
## Step: AIC=436.49
## high_use ~ G3 + absences + age
##
## Df Deviance AIC
## - age 1 429.93 435.93
## <none> 428.49 436.49
## - G3 1 432.17 438.17
## - absences 1 443.07 449.07
##
## Step: AIC=435.93
## high_use ~ G3 + absences
##
## Df Deviance AIC
## <none> 429.93 435.93
## - G3 1 434.52 438.52
## - absences 1 445.68 449.68
##
## Call: glm(formula = high_use ~ G3 + absences, family = "binomial",
## data = alc)
##
## Coefficients:
## (Intercept) G3 absences
## -0.38732 -0.07606 0.08423
##
## Degrees of Freedom: 369 Total (i.e. Null); 367 Residual
## Null Deviance: 452
## Residual Deviance: 429.9 AIC: 435.9
In this case such model would have just 2 predictors: absence and G3. So let’s create such model.
m <- glm(high_use ~ G3 + absences, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3286 -0.8298 -0.7219 1.2113 1.9242
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38732 0.43500 -0.890 0.373259
## G3 -0.07606 0.03553 -2.141 0.032311 *
## absences 0.08423 0.02309 3.648 0.000264 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 429.93 on 367 degrees of freedom
## AIC: 435.93
##
## Number of Fisher Scoring iterations: 4
So according to he model the astimate for parameter G3 is -0.08, and parameter for absence is 0.084. intercept is -0.39. To calculate odds ratios, we use following command:
OR <- coef(m) %>% exp
OR
## (Intercept) G3 absences
## 0.6788751 0.9267616 1.0878762
Ratios describe the change of predictable variable if the explainable variable is changed one unit.
CI=confint(m)
## Waiting for profiling to be done...
CI
## 2.5 % 97.5 %
## (Intercept) -1.25260555 0.459482521
## G3 -0.14621889 -0.006471056
## absences 0.04110528 0.131708308
This provides 95% confidence intervals. 97.5 being the upper and 2.5 the lower interval.
Hypothesis vise this would mean that age and family status does not have important role on does one consume high amounts of alcohol or not. And that high grades means less alcohol and hing absences is tied to high alcohol consumption. How eve it is unclear which one is the cause and which is the result.
alc = mutate(alc, probability = predict(m, type = "response"))
alc = mutate(alc, prediction =probability>0.5)
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.68108108 0.01891892
## TRUE 0.25675676 0.04324324
model was good at predicting if high use is false. How ever it also often assumed high use of alcohol even tough actually it was not the case.
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
# define the geom as points and draw the plot
g+geom_point()
The visual representation shows that the model often assumes High use to be false even tough it actually would be true.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(alc$high_use,alc$probability)
## [1] 0.2756757
The training error (the mean number of wrong predictions) is 0,28
library(boot)
## Warning: package 'boot' was built under R version 4.2.3
##
## Attaching package: 'boot'
## The following object is masked from 'package:flexmix':
##
## boot
## The following object is masked from 'package:lattice':
##
## melanoma
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2864865
The average number of wrong predictions is 0,2811, which is a greater than error in exercise set (0.26). This means we have worse model than in exercise 3. Lets try to find better model.
m2 <- glm(high_use ~ G3 + absences+ school + sex + address+ famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup+ famsup+ activities + nursery + higher + internet + romantic + famrel + freetime + goout + health + failures + paid + absences + G1 + G2, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ G3 + absences + school + sex + address +
## famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason +
## guardian + traveltime + studytime + schoolsup + famsup +
## activities + nursery + higher + internet + romantic + famrel +
## freetime + goout + health + failures + paid + absences +
## G1 + G2, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7620 -0.6544 -0.3590 0.4831 2.8990
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.10711 1.79919 -1.727 0.084177 .
## G3 0.05976 0.11463 0.521 0.602121
## absences 0.09105 0.02706 3.365 0.000766 ***
## schoolMS -0.48058 0.53496 -0.898 0.369001
## sexM 0.84885 0.33281 2.551 0.010755 *
## addressU -0.85479 0.41943 -2.038 0.041552 *
## famsizeLE3 0.51405 0.33078 1.554 0.120176
## PstatusT -0.29228 0.53799 -0.543 0.586937
## Medu 0.04021 0.22256 0.181 0.856645
## Fedu 0.14434 0.19799 0.729 0.465991
## Mjobhealth -1.07147 0.76687 -1.397 0.162353
## Mjobother -0.85282 0.50337 -1.694 0.090222 .
## Mjobservices -0.82754 0.58036 -1.426 0.153896
## Mjobteacher -0.34813 0.70825 -0.492 0.623052
## Fjobhealth 0.33260 1.19154 0.279 0.780142
## Fjobother 0.81603 0.86311 0.945 0.344431
## Fjobservices 1.19959 0.88620 1.354 0.175853
## Fjobteacher -0.34774 1.04522 -0.333 0.739367
## reasonhome 0.55292 0.39470 1.401 0.161262
## reasonother 1.52294 0.54867 2.776 0.005508 **
## reasonreputation 0.01215 0.41787 0.029 0.976799
## guardianmother -0.83744 0.37652 -2.224 0.026136 *
## guardianother -0.13975 0.80413 -0.174 0.862026
## traveltime 0.32446 0.24022 1.351 0.176797
## studytime -0.27488 0.20833 -1.319 0.187012
## schoolsupyes 0.03707 0.47710 0.078 0.938071
## famsupyes -0.07901 0.33491 -0.236 0.813504
## activitiesyes -0.58242 0.31635 -1.841 0.065608 .
## nurseryyes -0.48277 0.36927 -1.307 0.191090
## higheryes -0.30306 0.73540 -0.412 0.680263
## internetyes 0.09333 0.45868 0.203 0.838771
## romanticyes -0.44975 0.33749 -1.333 0.182662
## famrel -0.57891 0.17076 -3.390 0.000698 ***
## freetime 0.29873 0.17003 1.757 0.078928 .
## goout 0.90284 0.15767 5.726 1.03e-08 ***
## health 0.18851 0.11495 1.640 0.101042
## failures 0.33167 0.27916 1.188 0.234790
## paidyes 0.53225 0.32442 1.641 0.100883
## G1 -0.13568 0.12864 -1.055 0.291563
## G2 0.10089 0.15969 0.632 0.527532
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 310.30 on 330 degrees of freedom
## AIC: 390.3
##
## Number of Fisher Scoring iterations: 5
step(m2)
## Start: AIC=390.3
## high_use ~ G3 + absences + school + sex + address + famsize +
## Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian +
## traveltime + studytime + schoolsup + famsup + activities +
## nursery + higher + internet + romantic + famrel + freetime +
## goout + health + failures + paid + absences + G1 + G2
##
## Df Deviance AIC
## - Mjob 4 314.58 386.58
## - schoolsup 1 310.30 388.30
## - Medu 1 310.33 388.33
## - internet 1 310.34 388.34
## - famsup 1 310.35 388.35
## - higher 1 310.47 388.47
## - G3 1 310.57 388.57
## - Pstatus 1 310.59 388.59
## - G2 1 310.70 388.70
## - Fedu 1 310.83 388.83
## - Fjob 4 317.06 389.06
## - school 1 311.12 389.12
## - G1 1 311.40 389.40
## - failures 1 311.73 389.73
## - nursery 1 311.99 389.99
## - studytime 1 312.08 390.08
## - romantic 1 312.11 390.11
## - traveltime 1 312.14 390.14
## <none> 310.30 390.30
## - famsize 1 312.70 390.70
## - paid 1 313.03 391.03
## - health 1 313.05 391.05
## - freetime 1 313.44 391.44
## - guardian 2 315.67 391.67
## - activities 1 313.75 391.75
## - address 1 314.45 392.45
## - reason 3 319.54 393.54
## - sex 1 316.96 394.96
## - absences 1 321.59 399.59
## - famrel 1 322.20 400.20
## - goout 1 350.56 428.56
##
## Step: AIC=386.58
## high_use ~ G3 + absences + school + sex + address + famsize +
## Pstatus + Medu + Fedu + Fjob + reason + guardian + traveltime +
## studytime + schoolsup + famsup + activities + nursery + higher +
## internet + romantic + famrel + freetime + goout + health +
## failures + paid + G1 + G2
##
## Df Deviance AIC
## - Fjob 4 320.37 384.37
## - Medu 1 314.58 384.58
## - schoolsup 1 314.59 384.59
## - internet 1 314.63 384.63
## - Pstatus 1 314.67 384.67
## - famsup 1 314.72 384.72
## - higher 1 314.82 384.82
## - G2 1 314.84 384.84
## - G3 1 314.92 384.92
## - school 1 315.37 385.37
## - G1 1 315.71 385.71
## - studytime 1 315.74 385.74
## - Fedu 1 315.77 385.77
## - failures 1 316.08 386.08
## - romantic 1 316.33 386.33
## - health 1 316.39 386.39
## - nursery 1 316.44 386.44
## - traveltime 1 316.50 386.50
## <none> 314.58 386.58
## - guardian 2 318.64 386.64
## - famsize 1 317.00 387.00
## - freetime 1 317.40 387.40
## - activities 1 317.68 387.68
## - paid 1 318.01 388.01
## - reason 3 322.94 388.94
## - address 1 319.29 389.29
## - sex 1 321.42 391.42
## - famrel 1 326.22 396.22
## - absences 1 326.30 396.30
## - goout 1 353.06 423.06
##
## Step: AIC=384.37
## high_use ~ G3 + absences + school + sex + address + famsize +
## Pstatus + Medu + Fedu + reason + guardian + traveltime +
## studytime + schoolsup + famsup + activities + nursery + higher +
## internet + romantic + famrel + freetime + goout + health +
## failures + paid + G1 + G2
##
## Df Deviance AIC
## - Pstatus 1 320.39 382.39
## - internet 1 320.40 382.40
## - schoolsup 1 320.41 382.41
## - Medu 1 320.44 382.44
## - higher 1 320.51 382.51
## - G3 1 320.61 382.61
## - famsup 1 320.72 382.72
## - G2 1 320.84 382.84
## - Fedu 1 320.90 382.90
## - school 1 321.08 383.08
## - studytime 1 321.52 383.52
## - traveltime 1 321.79 383.79
## - failures 1 321.83 383.83
## - nursery 1 322.04 384.04
## - G1 1 322.23 384.23
## - romantic 1 322.23 384.23
## <none> 320.37 384.37
## - health 1 322.38 384.38
## - freetime 1 322.53 384.53
## - famsize 1 322.93 384.93
## - activities 1 323.16 385.16
## - guardian 2 325.32 385.32
## - reason 3 327.69 385.69
## - paid 1 324.81 386.81
## - address 1 325.55 387.55
## - sex 1 327.22 389.22
## - famrel 1 330.47 392.47
## - absences 1 332.72 394.72
## - goout 1 360.29 422.29
##
## Step: AIC=382.39
## high_use ~ G3 + absences + school + sex + address + famsize +
## Medu + Fedu + reason + guardian + traveltime + studytime +
## schoolsup + famsup + activities + nursery + higher + internet +
## romantic + famrel + freetime + goout + health + failures +
## paid + G1 + G2
##
## Df Deviance AIC
## - internet 1 320.42 380.42
## - schoolsup 1 320.43 380.43
## - Medu 1 320.45 380.45
## - higher 1 320.53 380.53
## - G3 1 320.65 380.65
## - famsup 1 320.75 380.75
## - G2 1 320.85 380.85
## - Fedu 1 320.93 380.93
## - school 1 321.12 381.12
## - studytime 1 321.53 381.53
## - traveltime 1 321.81 381.81
## - failures 1 321.86 381.86
## - nursery 1 322.05 382.05
## - romantic 1 322.25 382.25
## - G1 1 322.29 382.29
## - health 1 322.39 382.39
## <none> 320.39 382.39
## - freetime 1 322.53 382.53
## - famsize 1 323.01 383.01
## - activities 1 323.26 383.26
## - guardian 2 325.33 383.33
## - reason 3 327.71 383.71
## - paid 1 324.81 384.81
## - address 1 325.57 385.57
## - sex 1 327.24 387.24
## - famrel 1 330.51 390.51
## - absences 1 332.91 392.91
## - goout 1 360.30 420.30
##
## Step: AIC=380.42
## high_use ~ G3 + absences + school + sex + address + famsize +
## Medu + Fedu + reason + guardian + traveltime + studytime +
## schoolsup + famsup + activities + nursery + higher + romantic +
## famrel + freetime + goout + health + failures + paid + G1 +
## G2
##
## Df Deviance AIC
## - schoolsup 1 320.45 378.45
## - Medu 1 320.47 378.47
## - higher 1 320.57 378.57
## - G3 1 320.67 378.67
## - famsup 1 320.77 378.77
## - G2 1 320.87 378.87
## - Fedu 1 320.95 378.95
## - school 1 321.13 379.13
## - studytime 1 321.55 379.55
## - traveltime 1 321.83 379.83
## - failures 1 321.87 379.87
## - nursery 1 322.13 380.13
## - romantic 1 322.25 380.25
## - G1 1 322.30 380.30
## <none> 320.42 380.42
## - health 1 322.42 380.42
## - freetime 1 322.59 380.59
## - famsize 1 323.06 381.06
## - activities 1 323.26 381.26
## - guardian 2 325.37 381.37
## - reason 3 327.73 381.73
## - paid 1 324.93 382.93
## - address 1 325.67 383.67
## - sex 1 327.31 385.31
## - famrel 1 330.52 388.52
## - absences 1 333.27 391.27
## - goout 1 360.46 418.46
##
## Step: AIC=378.45
## high_use ~ G3 + absences + school + sex + address + famsize +
## Medu + Fedu + reason + guardian + traveltime + studytime +
## famsup + activities + nursery + higher + romantic + famrel +
## freetime + goout + health + failures + paid + G1 + G2
##
## Df Deviance AIC
## - Medu 1 320.50 376.50
## - higher 1 320.61 376.61
## - G3 1 320.69 376.69
## - famsup 1 320.80 376.80
## - G2 1 320.89 376.89
## - Fedu 1 320.97 376.97
## - school 1 321.13 377.13
## - studytime 1 321.60 377.60
## - traveltime 1 321.84 377.84
## - failures 1 321.90 377.90
## - nursery 1 322.16 378.16
## - romantic 1 322.27 378.27
## - G1 1 322.32 378.32
## <none> 320.45 378.45
## - health 1 322.49 378.49
## - freetime 1 322.61 378.61
## - famsize 1 323.10 379.10
## - activities 1 323.35 379.35
## - guardian 2 325.38 379.38
## - reason 3 327.75 379.75
## - paid 1 325.06 381.06
## - address 1 325.79 381.79
## - sex 1 327.67 383.67
## - famrel 1 330.54 386.54
## - absences 1 333.37 389.37
## - goout 1 360.66 416.66
##
## Step: AIC=376.5
## high_use ~ G3 + absences + school + sex + address + famsize +
## Fedu + reason + guardian + traveltime + studytime + famsup +
## activities + nursery + higher + romantic + famrel + freetime +
## goout + health + failures + paid + G1 + G2
##
## Df Deviance AIC
## - higher 1 320.66 374.66
## - G3 1 320.75 374.75
## - famsup 1 320.87 374.87
## - G2 1 320.93 374.93
## - Fedu 1 321.03 375.03
## - school 1 321.18 375.18
## - studytime 1 321.65 375.65
## - traveltime 1 321.91 375.91
## - failures 1 322.01 376.01
## - nursery 1 322.28 376.28
## - romantic 1 322.35 376.35
## - G1 1 322.37 376.37
## <none> 320.50 376.50
## - health 1 322.54 376.54
## - freetime 1 322.65 376.65
## - famsize 1 323.18 377.18
## - activities 1 323.42 377.42
## - guardian 2 325.60 377.60
## - reason 3 327.81 377.81
## - paid 1 325.07 379.07
## - address 1 325.93 379.93
## - sex 1 327.67 381.67
## - famrel 1 330.59 384.59
## - absences 1 333.40 387.40
## - goout 1 360.66 414.66
##
## Step: AIC=374.66
## high_use ~ G3 + absences + school + sex + address + famsize +
## Fedu + reason + guardian + traveltime + studytime + famsup +
## activities + nursery + romantic + famrel + freetime + goout +
## health + failures + paid + G1 + G2
##
## Df Deviance AIC
## - G3 1 320.88 372.88
## - famsup 1 321.06 373.06
## - G2 1 321.10 373.10
## - Fedu 1 321.14 373.14
## - school 1 321.38 373.38
## - studytime 1 321.86 373.86
## - traveltime 1 322.07 374.07
## - failures 1 322.33 374.33
## - romantic 1 322.38 374.38
## - nursery 1 322.40 374.40
## - G1 1 322.55 374.55
## <none> 320.66 374.66
## - health 1 322.67 374.67
## - freetime 1 322.79 374.79
## - famsize 1 323.32 375.32
## - activities 1 323.79 375.79
## - guardian 2 325.86 375.86
## - reason 3 327.93 375.93
## - paid 1 325.11 377.11
## - address 1 326.17 378.17
## - sex 1 328.20 380.20
## - famrel 1 330.73 382.73
## - absences 1 333.75 385.75
## - goout 1 360.88 412.88
##
## Step: AIC=372.88
## high_use ~ absences + school + sex + address + famsize + Fedu +
## reason + guardian + traveltime + studytime + famsup + activities +
## nursery + romantic + famrel + freetime + goout + health +
## failures + paid + G1 + G2
##
## Df Deviance AIC
## - famsup 1 321.23 371.23
## - Fedu 1 321.32 371.32
## - school 1 321.66 371.66
## - studytime 1 322.19 372.19
## - traveltime 1 322.34 372.34
## - failures 1 322.42 372.42
## - G2 1 322.54 372.54
## - romantic 1 322.56 372.56
## - G1 1 322.60 372.60
## - nursery 1 322.63 372.63
## <none> 320.88 372.88
## - health 1 322.91 372.91
## - freetime 1 323.01 373.01
## - famsize 1 323.45 373.45
## - guardian 2 325.96 373.96
## - activities 1 323.98 373.98
## - reason 3 327.99 373.99
## - paid 1 325.33 375.33
## - address 1 326.35 376.35
## - sex 1 328.32 378.32
## - famrel 1 330.73 380.73
## - absences 1 334.31 384.31
## - goout 1 361.17 411.17
##
## Step: AIC=371.23
## high_use ~ absences + school + sex + address + famsize + Fedu +
## reason + guardian + traveltime + studytime + activities +
## nursery + romantic + famrel + freetime + goout + health +
## failures + paid + G1 + G2
##
## Df Deviance AIC
## - Fedu 1 321.57 369.57
## - school 1 321.87 369.87
## - traveltime 1 322.66 370.66
## - studytime 1 322.74 370.74
## - failures 1 322.78 370.78
## - G2 1 322.87 370.87
## - G1 1 322.87 370.87
## - nursery 1 322.96 370.96
## - romantic 1 322.98 370.98
## - health 1 323.12 371.12
## <none> 321.23 371.23
## - freetime 1 323.23 371.23
## - famsize 1 323.87 371.87
## - guardian 2 326.22 372.22
## - activities 1 324.31 372.31
## - reason 3 328.42 372.42
## - paid 1 325.35 373.35
## - address 1 326.60 374.60
## - sex 1 329.13 377.13
## - famrel 1 330.86 378.86
## - absences 1 334.61 382.61
## - goout 1 361.79 409.79
##
## Step: AIC=369.57
## high_use ~ absences + school + sex + address + famsize + reason +
## guardian + traveltime + studytime + activities + nursery +
## romantic + famrel + freetime + goout + health + failures +
## paid + G1 + G2
##
## Df Deviance AIC
## - school 1 322.18 368.18
## - traveltime 1 322.86 368.86
## - failures 1 322.87 368.87
## - nursery 1 323.07 369.07
## - G1 1 323.12 369.12
## - studytime 1 323.23 369.23
## - G2 1 323.23 369.23
## - romantic 1 323.28 369.28
## - freetime 1 323.44 369.44
## <none> 321.57 369.57
## - health 1 323.58 369.58
## - famsize 1 324.02 370.02
## - activities 1 324.48 370.48
## - reason 3 328.70 370.70
## - guardian 2 327.05 371.05
## - paid 1 325.82 371.82
## - address 1 326.98 372.98
## - sex 1 329.62 375.62
## - famrel 1 331.29 377.29
## - absences 1 335.46 381.46
## - goout 1 363.61 409.61
##
## Step: AIC=368.18
## high_use ~ absences + sex + address + famsize + reason + guardian +
## traveltime + studytime + activities + nursery + romantic +
## famrel + freetime + goout + health + failures + paid + G1 +
## G2
##
## Df Deviance AIC
## - traveltime 1 323.22 367.22
## - G1 1 323.54 367.54
## - nursery 1 323.57 367.57
## - failures 1 323.64 367.64
## - G2 1 323.71 367.71
## - studytime 1 323.76 367.76
## - freetime 1 323.86 367.86
## - romantic 1 324.04 368.04
## <none> 322.18 368.18
## - health 1 324.53 368.53
## - famsize 1 324.55 368.55
## - reason 3 328.79 368.79
## - activities 1 324.93 368.93
## - guardian 2 327.39 369.39
## - paid 1 326.30 370.30
## - address 1 326.99 370.99
## - sex 1 330.30 374.30
## - famrel 1 331.74 375.74
## - absences 1 336.59 380.59
## - goout 1 364.02 408.02
##
## Step: AIC=367.22
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + nursery + romantic + famrel + freetime +
## goout + health + failures + paid + G1 + G2
##
## Df Deviance AIC
## - G1 1 324.57 366.57
## - G2 1 324.63 366.63
## - nursery 1 324.69 366.69
## - freetime 1 324.75 366.75
## - failures 1 324.81 366.81
## - studytime 1 324.99 366.99
## - romantic 1 325.10 367.10
## <none> 323.22 367.22
## - reason 3 329.53 367.53
## - health 1 325.57 367.57
## - famsize 1 325.97 367.97
## - activities 1 326.01 368.01
## - guardian 2 329.24 369.24
## - paid 1 327.40 369.40
## - address 1 330.64 372.64
## - sex 1 331.41 373.41
## - famrel 1 332.55 374.55
## - absences 1 337.62 379.62
## - goout 1 366.10 408.10
##
## Step: AIC=366.57
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + nursery + romantic + famrel + freetime +
## goout + health + failures + paid + G2
##
## Df Deviance AIC
## - G2 1 324.70 364.70
## - freetime 1 325.93 365.93
## - nursery 1 326.11 366.11
## - failures 1 326.24 366.24
## - studytime 1 326.54 366.54
## <none> 324.57 366.57
## - health 1 327.05 367.05
## - romantic 1 327.05 367.05
## - famsize 1 327.24 367.24
## - reason 3 331.26 367.26
## - activities 1 327.40 367.40
## - guardian 2 330.30 368.30
## - paid 1 329.01 369.01
## - address 1 331.70 371.70
## - sex 1 332.66 372.66
## - famrel 1 334.08 374.08
## - absences 1 338.85 378.85
## - goout 1 366.53 406.53
##
## Step: AIC=364.7
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + nursery + romantic + famrel + freetime +
## goout + health + failures + paid
##
## Df Deviance AIC
## - freetime 1 326.08 364.08
## - nursery 1 326.21 364.21
## - failures 1 326.24 364.24
## - studytime 1 326.57 364.57
## <none> 324.70 364.70
## - health 1 327.09 365.09
## - romantic 1 327.32 365.32
## - activities 1 327.43 365.43
## - reason 3 331.46 365.46
## - famsize 1 327.50 365.50
## - guardian 2 330.37 366.37
## - paid 1 329.18 367.18
## - address 1 331.74 369.74
## - sex 1 332.86 370.86
## - famrel 1 334.27 372.27
## - absences 1 338.85 376.85
## - goout 1 367.17 405.17
##
## Step: AIC=364.08
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + nursery + romantic + famrel + goout +
## health + failures + paid
##
## Df Deviance AIC
## - nursery 1 327.51 363.51
## - failures 1 327.64 363.64
## - studytime 1 328.04 364.04
## <none> 326.08 364.08
## - health 1 328.44 364.44
## - romantic 1 328.46 364.46
## - activities 1 328.54 364.54
## - reason 3 332.61 364.61
## - famsize 1 328.73 364.73
## - guardian 2 332.13 366.13
## - paid 1 330.49 366.49
## - address 1 332.70 368.70
## - famrel 1 335.05 371.05
## - sex 1 335.40 371.40
## - absences 1 339.96 375.96
## - goout 1 375.82 411.82
##
## Step: AIC=363.51
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + romantic + famrel + goout + health +
## failures + paid
##
## Df Deviance AIC
## - failures 1 329.01 363.01
## <none> 327.51 363.51
## - famsize 1 329.70 363.70
## - romantic 1 329.86 363.86
## - studytime 1 329.87 363.87
## - activities 1 329.88 363.88
## - reason 3 333.89 363.89
## - health 1 329.95 363.95
## - paid 1 331.64 365.64
## - guardian 2 334.22 366.22
## - address 1 334.73 368.73
## - famrel 1 336.36 370.36
## - sex 1 336.42 370.42
## - absences 1 340.99 374.99
## - goout 1 376.85 410.85
##
## Step: AIC=363.01
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + romantic + famrel + goout + health +
## paid
##
## Df Deviance AIC
## <none> 329.01 363.01
## - famsize 1 331.20 363.20
## - romantic 1 331.25 363.25
## - reason 3 335.60 363.60
## - activities 1 331.61 363.61
## - health 1 331.97 363.97
## - studytime 1 332.12 364.12
## - paid 1 332.59 364.59
## - guardian 2 335.95 365.95
## - address 1 336.70 368.70
## - sex 1 338.10 370.10
## - famrel 1 338.76 370.76
## - absences 1 343.00 375.00
## - goout 1 382.34 414.34
##
## Call: glm(formula = high_use ~ absences + sex + address + famsize +
## reason + guardian + studytime + activities + romantic + famrel +
## goout + health + paid, family = "binomial", data = alc)
##
## Coefficients:
## (Intercept) absences sexM addressU
## -1.72681 0.09174 0.91034 -0.94771
## famsizeLE3 reasonhome reasonother reasonreputation
## 0.45183 0.20790 1.02210 -0.27714
## guardianmother guardianother studytime activitiesyes
## -0.75088 0.40855 -0.32528 -0.46361
## romanticyes famrel goout health
## -0.45457 -0.48800 0.90858 0.17681
## paidyes
## 0.54838
##
## Degrees of Freedom: 369 Total (i.e. Null); 353 Residual
## Null Deviance: 452
## Residual Deviance: 329 AIC: 363
m3 <- glm(high_use ~ absences + sex + address + famsize +
reason + guardian + studytime + activities + romantic + famrel +
goout + health + paid, data = alc, family = "binomial")
summary(m3)
##
## Call:
## glm(formula = high_use ~ absences + sex + address + famsize +
## reason + guardian + studytime + activities + romantic + famrel +
## goout + health + paid, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8674 -0.7171 -0.4029 0.5821 2.7665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.72681 0.94633 -1.825 0.068041 .
## absences 0.09174 0.02432 3.773 0.000161 ***
## sexM 0.91034 0.30568 2.978 0.002901 **
## addressU -0.94771 0.34320 -2.761 0.005756 **
## famsizeLE3 0.45183 0.30426 1.485 0.137536
## reasonhome 0.20790 0.36585 0.568 0.569858
## reasonother 1.02210 0.48846 2.093 0.036393 *
## reasonreputation -0.27714 0.37752 -0.734 0.462877
## guardianmother -0.75088 0.33290 -2.256 0.024096 *
## guardianother 0.40855 0.73349 0.557 0.577533
## studytime -0.32528 0.18830 -1.727 0.084081 .
## activitiesyes -0.46361 0.28927 -1.603 0.109002
## romanticyes -0.45457 0.30736 -1.479 0.139160
## famrel -0.48800 0.15813 -3.086 0.002028 **
## goout 0.90858 0.13835 6.567 5.13e-11 ***
## health 0.17681 0.10399 1.700 0.089096 .
## paidyes 0.54838 0.29252 1.875 0.060834 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 329.01 on 353 degrees of freedom
## AIC: 363.01
##
## Number of Fisher Scoring iterations: 5
alc2 = mutate(alc, probability = predict(m3, type = "response"))
alc = mutate(alc, prediction =probability>0.5)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(alc$high_use,alc$probability)
## [1] 0.2756757
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2783784
I tried to create as good model as possible by introducing all variable to a model and used step function to find model with smallest AIC ending up with model m3. How ever when calculating the error, it was 0.2783. Just slightly smaller than on the original model. In addition, i am coding this late at night as the deadline is looming, so i must just give up and declare that i cannot find better model than on exercise 3.
date()
## [1] "Tue Dec 5 15:10:17 2023"
Let’s import data. We will use “Boston” data from MASS package.
library(MASS)
## Warning: package 'MASS' was built under R version 4.2.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
data("Boston")
Now we can explore how the data looks
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Dataset “Boston” has 506 observations and 14 variables. Most of the varables are numeric, but two are integrals. Variables of the data and their diecription are presented bellow.
| Variable | Descripition |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable |
| nox | nitrogen oxides concentration |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | proportion of blacks by town |
| lstat | lower status of the population |
| medv | median value of owner-occupied homes in $1000s. |
Summary of the data
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Here you can see the pair-wise comparisation of each variable. Matrix is fairly small in size but provides some visual insigts.
One can see that some variables have little relationships (such as age
and ptratio) whereas some have strong relationship (nox and dis).
Correlation matrix could be easier to interpret.
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
##Scaling the data
boston_scaled= scale(Boston)
boston_scaled=as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
After scaling the mean of each variable is 0. So instead of absolute values the value suggests how far from the mean within that variable the observation is.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)
Here we can find see what variables separates the data into different
groups. It seems that rad has the greatest effect on LD1 where as nox
has the greatest effect on LD2. If i interpret the graph correctly,
access to high ways is linked to high crime rate and nitrogen oxides to
medium high rates.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 12 2 0
## med_low 3 12 7 0
## med_high 0 5 16 1
## high 0 0 1 31
In the most cases the model made correct predictions. Especially success rate on predicting high crime rate was good. Predicting medium low crimes seems to be the most difficult. How ever the predictions are not perfect: for example in one case the model predicted medium high crime rate even tough actual rate was low.
data(Boston)
scal_bos=as.data.frame(scale(Boston))
dist_eu <- dist(scal_bos)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
##K-means
Let’s calculate k-means
km <- kmeans(scal_bos, centers = 5)
summary(km)
## Length Class Mode
## cluster 506 -none- numeric
## centers 70 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
How ever the means calculated above would be better if we would fing out the optimal number of clusetrs. So let us do that!
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(scal_bos, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
km <- kmeans(scal_bos, centers = 2)
pairs(scal_bos, col = km$cluster)
It is hard to see what is going on, so let’s take a closer look on some
of interesting looking variables
pairs(scal_bos[c(1,10)], col = km$cluster)
pairs(scal_bos[c(1,7)], col = km$cluster)
For example on these two plots one can see clear clustering with crime
and tax. High tax and high crime rate seems to create two different
clusters. This suggests that there are less crime in areas with high
tax.
It also seems that crime rate increases in with owner-occupied units built prior to 1940.
km <- kmeans(scal_bos, centers = 4)
lda.fit <- lda(km$cluster ~ ., data = scal_bos)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2.5)
The most influential variable is black on LD2. On LD1 it is hard to determine, but it might be nox.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
model_predictors <- dplyr::select(train, -crime)
lda.fit <- lda(crime ~ ., data = train)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)
Let’s import the data!
human=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/human.csv")
human=human[,-1]
Next we move country names to column names and take a look at the data
library(tibble)
## Warning: package 'tibble' was built under R version 4.2.3
library(GGally)
library(dplyr)
human=column_to_rownames(human, "Country")
ggpairs(human)
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
May of the variables are correlated with each other. i.e. life
expectancy and education. How ever some variables have little
correlation with each other such as labour force participation and
secondary education.
***
pca_human= prcomp(human)
biplot(pca_human, choices = 1:2, cex=c(1, 1), col=c("grey40", "red"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Without standardisation differnet scales makes graph impossible to read.
Maximum value of GNI is so large, it dwarfs all other variables. Such we
must standardize the variables.
human_std= scale(human)
pca_human_std= prcomp(human_std)
biplot(pca_human_std, choices = 1:2, cex=c(1, 1), col=c("gray40", "red"))
After standardization the graph is easier to interpret. Labo.FM has
greatest effect on PC2. It is harder to say witch variable has greatest
effect on PC1, perhaps edu.exp.
There are 3 separate groups. labo.fm and paril.F create the first group.
They correlated but they are not correlated to any other variable.
Variables Mat.Mor and Ado.Birth are correlated and create group 2. Group
3 is composed of the rest of variables. Variables in group 2 and 3 are
negative correlated. This means that in countries with high education
and high income decreases Maternal Mortality Ratio and Adolescent Birth
Rate. In other words, those decrease as welfare increases. Group 1
suggests that Labor Force Participation Rate and female percent
Representation in Parliament has little to do with income and education.
Thus PC1 represents welfare (or as high welfare is in negative it
represents the hardship). And PC2 represents variables that don’t have
correlation with that. PC2 could be more about equality. High % of labor
suggests all genders are actively working outside home and high precent
of females in parliament suggests that they females are active operators
in society.
tea = read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
View(tea)
Lets see of there are NA values
unique(is.na(tea))
## breakfast tea.time evening lunch dinner always home work tearoom friends
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## resto pub Tea How sugar how where price age sex SPC Sport
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## age_Q frequency escape.exoticism spirituality healthy diuretic
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## friendliness iron.absorption feminine sophisticated slimming exciting
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## relaxing effect.on.health
## [1,] FALSE FALSE
no Na values
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.3
keep=c("home", "work", "tearoom", "friends", "resto", "pub")
tea_loc_sex= select(tea, one_of(keep))
mca <- MCA(tea_loc_sex, graph = FALSE)
plot(mca, invisible=c("ind"), graph.type = "classic", cex=0.8)
I chose variables that represent where tea is consumed. It seems that
people taht consume tea, consume it everywhere( work, restaurants,
tearooms, pubs etc.). How ever it is interesting that “not home”
variable has little correlation to other “not” answers. It seems that
dimension 1 is more about “do you drink tea”, but not home is clearly an
outlier, as all other not-answers are bellow zero on dimension 1. Lets
try other ploting options!
plot(mca, invisible=c("ind"), graph.type = "ggplot", cex=0.8)
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Using ggplot option, the graph is a lot clearer as the texts do not
overlap as much.
That’s all folks!
…
Let’s import data sets required for the assignment.
BPRSL=read.table("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/BPRSL.csv", header = TRUE, sep = ",")
BPRSL=BPRSL[,-1]
BPRSL$treatment=factor(BPRSL$treatment)
BPRSL$subject=factor(BPRSL$subject)
RATSL=read.table("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/RATSL.csv", header = TRUE, sep = ",")
RATSL=RATSL[,-c(1,4)]
RATSL$Group=factor(RATSL$Group)
RATSL$ID=factor(RATSL$ID)
First we fetch some packages
The data is already in log form. It is not as easy to interpret by eye, but R prefers that format. How ever we can have a look on the data (or part of it)
knitr::kable(glimpse(RATSL))
## Rows: 176
## Columns: 4
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
| ID | Group | Weight | Time |
|---|---|---|---|
| 1 | 1 | 240 | 1 |
| 2 | 1 | 225 | 1 |
| 3 | 1 | 245 | 1 |
| 4 | 1 | 260 | 1 |
| 5 | 1 | 255 | 1 |
| 6 | 1 | 260 | 1 |
| 7 | 1 | 275 | 1 |
| 8 | 1 | 245 | 1 |
| 9 | 2 | 410 | 1 |
| 10 | 2 | 405 | 1 |
| 11 | 2 | 445 | 1 |
| 12 | 2 | 555 | 1 |
| 13 | 3 | 470 | 1 |
| 14 | 3 | 535 | 1 |
| 15 | 3 | 520 | 1 |
| 16 | 3 | 510 | 1 |
| 1 | 1 | 250 | 8 |
| 2 | 1 | 230 | 8 |
| 3 | 1 | 250 | 8 |
| 4 | 1 | 255 | 8 |
| 5 | 1 | 260 | 8 |
| 6 | 1 | 265 | 8 |
| 7 | 1 | 275 | 8 |
| 8 | 1 | 255 | 8 |
| 9 | 2 | 415 | 8 |
| 10 | 2 | 420 | 8 |
| 11 | 2 | 445 | 8 |
| 12 | 2 | 560 | 8 |
| 13 | 3 | 465 | 8 |
| 14 | 3 | 525 | 8 |
| 15 | 3 | 525 | 8 |
| 16 | 3 | 510 | 8 |
| 1 | 1 | 255 | 15 |
| 2 | 1 | 230 | 15 |
| 3 | 1 | 250 | 15 |
| 4 | 1 | 255 | 15 |
| 5 | 1 | 255 | 15 |
| 6 | 1 | 270 | 15 |
| 7 | 1 | 260 | 15 |
| 8 | 1 | 260 | 15 |
| 9 | 2 | 425 | 15 |
| 10 | 2 | 430 | 15 |
| 11 | 2 | 450 | 15 |
| 12 | 2 | 565 | 15 |
| 13 | 3 | 475 | 15 |
| 14 | 3 | 530 | 15 |
| 15 | 3 | 530 | 15 |
| 16 | 3 | 520 | 15 |
| 1 | 1 | 260 | 22 |
| 2 | 1 | 232 | 22 |
| 3 | 1 | 255 | 22 |
| 4 | 1 | 265 | 22 |
| 5 | 1 | 270 | 22 |
| 6 | 1 | 275 | 22 |
| 7 | 1 | 270 | 22 |
| 8 | 1 | 268 | 22 |
| 9 | 2 | 428 | 22 |
| 10 | 2 | 440 | 22 |
| 11 | 2 | 452 | 22 |
| 12 | 2 | 580 | 22 |
| 13 | 3 | 485 | 22 |
| 14 | 3 | 533 | 22 |
| 15 | 3 | 540 | 22 |
| 16 | 3 | 515 | 22 |
| 1 | 1 | 262 | 29 |
| 2 | 1 | 240 | 29 |
| 3 | 1 | 262 | 29 |
| 4 | 1 | 265 | 29 |
| 5 | 1 | 270 | 29 |
| 6 | 1 | 275 | 29 |
| 7 | 1 | 273 | 29 |
| 8 | 1 | 270 | 29 |
| 9 | 2 | 438 | 29 |
| 10 | 2 | 448 | 29 |
| 11 | 2 | 455 | 29 |
| 12 | 2 | 590 | 29 |
| 13 | 3 | 487 | 29 |
| 14 | 3 | 535 | 29 |
| 15 | 3 | 543 | 29 |
| 16 | 3 | 530 | 29 |
| 1 | 1 | 258 | 36 |
| 2 | 1 | 240 | 36 |
| 3 | 1 | 265 | 36 |
| 4 | 1 | 268 | 36 |
| 5 | 1 | 273 | 36 |
| 6 | 1 | 277 | 36 |
| 7 | 1 | 274 | 36 |
| 8 | 1 | 265 | 36 |
| 9 | 2 | 443 | 36 |
| 10 | 2 | 460 | 36 |
| 11 | 2 | 455 | 36 |
| 12 | 2 | 597 | 36 |
| 13 | 3 | 493 | 36 |
| 14 | 3 | 540 | 36 |
| 15 | 3 | 546 | 36 |
| 16 | 3 | 538 | 36 |
| 1 | 1 | 266 | 43 |
| 2 | 1 | 243 | 43 |
| 3 | 1 | 267 | 43 |
| 4 | 1 | 270 | 43 |
| 5 | 1 | 274 | 43 |
| 6 | 1 | 278 | 43 |
| 7 | 1 | 276 | 43 |
| 8 | 1 | 265 | 43 |
| 9 | 2 | 442 | 43 |
| 10 | 2 | 458 | 43 |
| 11 | 2 | 451 | 43 |
| 12 | 2 | 595 | 43 |
| 13 | 3 | 493 | 43 |
| 14 | 3 | 525 | 43 |
| 15 | 3 | 538 | 43 |
| 16 | 3 | 535 | 43 |
| 1 | 1 | 266 | 44 |
| 2 | 1 | 244 | 44 |
| 3 | 1 | 267 | 44 |
| 4 | 1 | 272 | 44 |
| 5 | 1 | 273 | 44 |
| 6 | 1 | 278 | 44 |
| 7 | 1 | 271 | 44 |
| 8 | 1 | 267 | 44 |
| 9 | 2 | 446 | 44 |
| 10 | 2 | 464 | 44 |
| 11 | 2 | 450 | 44 |
| 12 | 2 | 595 | 44 |
| 13 | 3 | 504 | 44 |
| 14 | 3 | 530 | 44 |
| 15 | 3 | 544 | 44 |
| 16 | 3 | 542 | 44 |
| 1 | 1 | 265 | 50 |
| 2 | 1 | 238 | 50 |
| 3 | 1 | 264 | 50 |
| 4 | 1 | 274 | 50 |
| 5 | 1 | 276 | 50 |
| 6 | 1 | 284 | 50 |
| 7 | 1 | 282 | 50 |
| 8 | 1 | 273 | 50 |
| 9 | 2 | 456 | 50 |
| 10 | 2 | 475 | 50 |
| 11 | 2 | 462 | 50 |
| 12 | 2 | 612 | 50 |
| 13 | 3 | 507 | 50 |
| 14 | 3 | 543 | 50 |
| 15 | 3 | 553 | 50 |
| 16 | 3 | 550 | 50 |
| 1 | 1 | 272 | 57 |
| 2 | 1 | 247 | 57 |
| 3 | 1 | 268 | 57 |
| 4 | 1 | 273 | 57 |
| 5 | 1 | 278 | 57 |
| 6 | 1 | 279 | 57 |
| 7 | 1 | 281 | 57 |
| 8 | 1 | 274 | 57 |
| 9 | 2 | 468 | 57 |
| 10 | 2 | 484 | 57 |
| 11 | 2 | 466 | 57 |
| 12 | 2 | 618 | 57 |
| 13 | 3 | 518 | 57 |
| 14 | 3 | 544 | 57 |
| 15 | 3 | 555 | 57 |
| 16 | 3 | 553 | 57 |
| 1 | 1 | 278 | 64 |
| 2 | 1 | 245 | 64 |
| 3 | 1 | 269 | 64 |
| 4 | 1 | 275 | 64 |
| 5 | 1 | 280 | 64 |
| 6 | 1 | 281 | 64 |
| 7 | 1 | 284 | 64 |
| 8 | 1 | 278 | 64 |
| 9 | 2 | 478 | 64 |
| 10 | 2 | 496 | 64 |
| 11 | 2 | 472 | 64 |
| 12 | 2 | 628 | 64 |
| 13 | 3 | 525 | 64 |
| 14 | 3 | 559 | 64 |
| 15 | 3 | 548 | 64 |
| 16 | 3 | 569 | 64 |
Let’s make it easier to understand what is going on by ploting the data
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
The weight among groups seems to vary and there seems to be one noticeably larger rat in group 2.
It is easier to comapre differences if we standardize the data.
RATSL = RATSL %>%
group_by(Time) %>%
mutate(stdRATSL = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
ggplot(RATSL, aes(x = Time, y = stdRATSL, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
Now we can see how much standard deviations the observations vary form
the mean
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = (sd(Weight)/(sqrt(length(Weight)))) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,4)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1", color=Group), width=0.5) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(bprs) +/- se(bprs)")
Mean weight of group 1 differs clearly from groups 2 and 3. The
difference between 2 and 3 is a lot smaller as time increases, their
standard error bars overlaps within each other.
Let’s draw box plots!
RATSBOX= RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSBOX, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight)")
There seems to be one outlier in each group. Now in general we should be
careful when deleting outliers, as we might as well be deleting
legitimate observations and thus causing bias. How ever as this is just
a exercise to learn to use the tools we can discard them!
RATSBOX1=RATSBOX[-which(RATSBOX$Group==3 & RATSBOX$mean<500),]
RATSBOX1=RATSBOX1[-which(RATSBOX$Group==1 & RATSBOX$mean<250),]
RATSBOX1=RATSBOX1[-which(RATSBOX$Group==2 & RATSBOX$mean>500),]
ggplot(RATSBOX1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight)")
Now there are no mor outliers. After removing the outliers it would be
time to perform a t-test. How ever t-test would require 2 gopups and we
have 3. That is an issue! To solve this we will abondon the t-test and
move to anova.
summary(aov(mean ~ Group, data = RATSBOX1))
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 183732 91866 59.87 2.72e-06 ***
## Residuals 10 15345 1534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that the groups are significantly differrent.
Now we can move on to modelling!
orig_rats=read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = T)
RATS2 = RATSBOX %>%
mutate(baseline = orig_rats$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline+Group, data =RATS2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The mean differs significangtly from the baseline. The significance of Group is significant at 10 % confidence level but not on 95%.
For this section we will use the BPRS-dataset. It is already transformed to long format during datawrangling stage and imported during the “Data” section. Let’s plot the data
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject, color=subject)) +
geom_line()
The data seems quite messy, but there it is. Let’s create a simple regression model!
BPRSL_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRSL_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Week is significant where as treatment has no significant effect on bprs.
Let’s continue with Random Intercept Model.
library(lme4)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.2.3
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'lme4'
## The following object is masked from 'package:flexmix':
##
## refit
BPRS_ref = lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Estimate for treatment 2 is a bit (0.6 units) higher for treatment 2 than for treatment 1
Next it is time for Random Intercept and Random Slope Model!
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
According to the information criteria (AIC and BIC) the models are not too different. Still the treatment 2 is about 0.6 units larger than observation on treatment 1.
We can compare the 2 previois models.
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As smaller chisq equals better fit, it sees that addition of “week” actually decreased the fit of a model.
Let’s add yet predictors to take passage of time into account.
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week * treatment | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week * treatment | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2525.8 2580.2 -1248.9 2497.8 346
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4842 -0.5448 -0.0829 0.4364 3.7005
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 178.447 13.358
## week 2.924 1.710 -0.86
## treatment2 379.364 19.477 -0.76 0.76
## week:treatment2 3.457 1.859 0.68 -0.73 -0.82
## Residual 36.748 6.062
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 44.9472 2.3816 18.872
## week -2.1809 0.2922 -7.464
## treatment2 2.9726 2.7748 1.071
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.728
## treatment2 -0.526 0.377
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week * treatment | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 14 2525.8 2580.2 -1248.9 2497.8 233.63 7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the information criterias we now created a better model and models are significantly different.
Let’s plot the fitted values!
Fitted <- fitted(BPRS_ref2)
mutate(BPRSL, Fitted)
## treatment subject weeks bprs week Fitted
## 1 1 1 week0 42 0 39.53023
## 2 1 2 week0 58 0 64.55924
## 3 1 3 week0 54 0 52.40843
## 4 1 4 week0 55 0 63.81002
## 5 1 5 week0 72 0 74.69734
## 6 1 6 week0 48 0 46.40994
## 7 1 7 week0 71 0 57.11848
## 8 1 8 week0 30 0 35.31448
## 9 1 9 week0 41 0 42.35846
## 10 1 10 week0 57 0 57.79629
## 11 1 11 week0 30 0 34.28622
## 12 1 12 week0 55 0 57.23696
## 13 1 13 week0 36 0 35.62519
## 14 1 14 week0 38 0 37.38952
## 15 1 15 week0 66 0 64.36611
## 16 1 16 week0 41 0 43.13426
## 17 1 17 week0 45 0 44.19203
## 18 1 18 week0 39 0 35.00070
## 19 1 19 week0 24 0 30.13647
## 20 1 20 week0 38 0 34.84925
## 21 2 1 week0 52 0 52.47253
## 22 2 2 week0 30 0 30.12844
## 23 2 3 week0 65 0 42.56787
## 24 2 4 week0 37 0 34.83445
## 25 2 5 week0 59 0 63.96044
## 26 2 6 week0 30 0 35.83590
## 27 2 7 week0 69 0 53.30648
## 28 2 8 week0 62 0 53.80063
## 29 2 9 week0 38 0 39.67465
## 30 2 10 week0 65 0 47.66991
## 31 2 11 week0 78 0 81.21145
## 32 2 12 week0 38 0 39.05351
## 33 2 13 week0 63 0 63.37518
## 34 2 14 week0 40 0 39.63652
## 35 2 15 week0 40 0 47.30124
## 36 2 16 week0 54 0 44.02844
## 37 2 17 week0 33 0 37.64772
## 38 2 18 week0 28 0 31.93695
## 39 2 19 week0 52 0 41.57177
## 40 2 20 week0 47 0 39.36629
## 41 1 1 week1 36 1 39.78987
## 42 1 2 week1 68 1 59.57317
## 43 1 3 week1 55 1 48.72009
## 44 1 4 week1 77 1 60.72734
## 45 1 5 week1 75 1 69.11971
## 46 1 6 week1 43 1 43.71205
## 47 1 7 week1 61 1 53.09874
## 48 1 8 week1 36 1 34.26542
## 49 1 9 week1 43 1 39.46305
## 50 1 10 week1 51 1 54.94666
## 51 1 11 week1 34 1 34.60202
## 52 1 12 week1 52 1 54.00613
## 53 1 13 week1 32 1 33.77357
## 54 1 14 week1 35 1 35.68504
## 55 1 15 week1 68 1 59.65388
## 56 1 16 week1 35 1 40.93511
## 57 1 17 week1 38 1 42.21711
## 58 1 18 week1 35 1 33.32487
## 59 1 19 week1 28 1 29.03702
## 60 1 20 week1 34 1 32.87497
## 61 2 1 week1 73 1 51.59396
## 62 2 2 week1 23 1 28.39425
## 63 2 3 week1 31 1 40.02322
## 64 2 4 week1 31 1 33.37756
## 65 2 5 week1 67 1 60.05537
## 66 2 6 week1 33 1 34.11580
## 67 2 7 week1 52 1 50.35266
## 68 2 8 week1 54 1 53.79601
## 69 2 9 week1 40 1 37.07948
## 70 2 10 week1 44 1 46.00169
## 71 2 11 week1 95 1 78.82380
## 72 2 12 week1 41 1 36.72244
## 73 2 13 week1 65 1 59.70320
## 74 2 14 week1 37 1 38.12660
## 75 2 15 week1 36 1 45.16243
## 76 2 16 week1 45 1 40.72771
## 77 2 17 week1 41 1 38.03655
## 78 2 18 week1 30 1 31.62848
## 79 2 19 week1 43 1 38.69472
## 80 2 20 week1 36 1 36.84157
## 81 1 1 week2 36 2 40.04952
## 82 1 2 week2 61 2 54.58710
## 83 1 3 week2 41 2 45.03176
## 84 1 4 week2 49 2 57.64466
## 85 1 5 week2 72 2 63.54207
## 86 1 6 week2 41 2 41.01417
## 87 1 7 week2 47 2 49.07901
## 88 1 8 week2 38 2 33.21635
## 89 1 9 week2 39 2 36.56765
## 90 1 10 week2 51 2 52.09703
## 91 1 11 week2 34 2 34.91782
## 92 1 12 week2 49 2 50.77529
## 93 1 13 week2 36 2 31.92196
## 94 1 14 week2 36 2 33.98057
## 95 1 15 week2 65 2 54.94165
## 96 1 16 week2 45 2 38.73595
## 97 1 17 week2 46 2 40.24218
## 98 1 18 week2 27 2 31.64904
## 99 1 19 week2 31 2 27.93756
## 100 1 20 week2 27 2 30.90070
## 101 2 1 week2 42 2 50.71539
## 102 2 2 week2 32 2 26.66006
## 103 2 3 week2 33 2 37.47856
## 104 2 4 week2 27 2 31.92068
## 105 2 5 week2 58 2 56.15031
## 106 2 6 week2 37 2 32.39571
## 107 2 7 week2 41 2 47.39883
## 108 2 8 week2 49 2 53.79139
## 109 2 9 week2 38 2 34.48432
## 110 2 10 week2 31 2 44.33347
## 111 2 11 week2 75 2 76.43614
## 112 2 12 week2 36 2 34.39137
## 113 2 13 week2 60 2 56.03122
## 114 2 14 week2 31 2 36.61669
## 115 2 15 week2 55 2 43.02361
## 116 2 16 week2 35 2 37.42697
## 117 2 17 week2 30 2 38.42537
## 118 2 18 week2 29 2 31.32002
## 119 2 19 week2 26 2 35.81766
## 120 2 20 week2 32 2 34.31685
## 121 1 1 week3 43 3 40.30917
## 122 1 2 week3 55 3 49.60103
## 123 1 3 week3 38 3 41.34342
## 124 1 4 week3 54 3 54.56197
## 125 1 5 week3 65 3 57.96444
## 126 1 6 week3 38 3 38.31628
## 127 1 7 week3 30 3 45.05927
## 128 1 8 week3 38 3 32.16729
## 129 1 9 week3 35 3 33.67224
## 130 1 10 week3 55 3 49.24740
## 131 1 11 week3 41 3 35.23362
## 132 1 12 week3 54 3 47.54445
## 133 1 13 week3 31 3 30.07034
## 134 1 14 week3 34 3 32.27609
## 135 1 15 week3 49 3 50.22941
## 136 1 16 week3 42 3 36.53680
## 137 1 17 week3 38 3 38.26726
## 138 1 18 week3 25 3 29.97321
## 139 1 19 week3 28 3 26.83811
## 140 1 20 week3 25 3 28.92642
## 141 2 1 week3 41 3 49.83682
## 142 2 2 week3 24 3 24.92588
## 143 2 3 week3 28 3 34.93390
## 144 2 4 week3 31 3 30.46379
## 145 2 5 week3 61 3 52.24525
## 146 2 6 week3 33 3 30.67561
## 147 2 7 week3 33 3 44.44500
## 148 2 8 week3 39 3 53.78677
## 149 2 9 week3 27 3 31.88916
## 150 2 10 week3 34 3 42.66526
## 151 2 11 week3 76 3 74.04848
## 152 2 12 week3 27 3 32.06030
## 153 2 13 week3 53 3 52.35925
## 154 2 14 week3 38 3 35.10678
## 155 2 15 week3 55 3 40.88479
## 156 2 16 week3 27 3 34.12624
## 157 2 17 week3 32 3 38.81420
## 158 2 18 week3 33 3 31.01156
## 159 2 19 week3 27 3 32.94060
## 160 2 20 week3 29 3 31.79213
## 161 1 1 week4 41 4 40.56882
## 162 1 2 week4 43 4 44.61496
## 163 1 3 week4 43 4 37.65508
## 164 1 4 week4 56 4 51.47929
## 165 1 5 week4 50 4 52.38681
## 166 1 6 week4 36 4 35.61840
## 167 1 7 week4 27 4 41.03953
## 168 1 8 week4 31 4 31.11823
## 169 1 9 week4 28 4 30.77683
## 170 1 10 week4 53 4 46.39777
## 171 1 11 week4 36 4 35.54942
## 172 1 12 week4 48 4 44.31361
## 173 1 13 week4 25 4 28.21872
## 174 1 14 week4 25 4 30.57162
## 175 1 15 week4 36 4 45.51718
## 176 1 16 week4 31 4 34.33765
## 177 1 17 week4 40 4 36.29233
## 178 1 18 week4 29 4 28.29738
## 179 1 19 week4 29 4 25.73865
## 180 1 20 week4 25 4 26.95214
## 181 2 1 week4 39 4 48.95824
## 182 2 2 week4 20 4 23.19169
## 183 2 3 week4 22 4 32.38924
## 184 2 4 week4 31 4 29.00690
## 185 2 5 week4 49 4 48.34018
## 186 2 6 week4 28 4 28.95551
## 187 2 7 week4 34 4 41.49118
## 188 2 8 week4 55 4 53.78215
## 189 2 9 week4 31 4 29.29399
## 190 2 10 week4 39 4 40.99704
## 191 2 11 week4 66 4 71.66082
## 192 2 12 week4 29 4 29.72923
## 193 2 13 week4 52 4 48.68727
## 194 2 14 week4 35 4 33.59687
## 195 2 15 week4 42 4 38.74597
## 196 2 16 week4 25 4 30.82551
## 197 2 17 week4 46 4 39.20303
## 198 2 18 week4 30 4 30.70310
## 199 2 19 week4 24 4 30.06355
## 200 2 20 week4 25 4 29.26741
## 201 1 1 week5 40 5 40.82847
## 202 1 2 week5 34 5 39.62889
## 203 1 3 week5 28 5 33.96674
## 204 1 4 week5 50 5 48.39661
## 205 1 5 week5 39 5 46.80917
## 206 1 6 week5 29 5 32.92052
## 207 1 7 week5 40 5 37.01979
## 208 1 8 week5 26 5 30.06917
## 209 1 9 week5 22 5 27.88143
## 210 1 10 week5 43 5 43.54814
## 211 1 11 week5 36 5 35.86522
## 212 1 12 week5 43 5 41.08278
## 213 1 13 week5 25 5 26.36711
## 214 1 14 week5 27 5 28.86714
## 215 1 15 week5 32 5 40.80495
## 216 1 16 week5 31 5 32.13849
## 217 1 17 week5 33 5 34.31741
## 218 1 18 week5 28 5 26.62155
## 219 1 19 week5 21 5 24.63920
## 220 1 20 week5 27 5 24.97787
## 221 2 1 week5 38 5 48.07967
## 222 2 2 week5 20 5 21.45750
## 223 2 3 week5 25 5 29.84459
## 224 2 4 week5 26 5 27.55002
## 225 2 5 week5 38 5 44.43512
## 226 2 6 week5 26 5 27.23542
## 227 2 7 week5 37 5 38.53735
## 228 2 8 week5 51 5 53.77753
## 229 2 9 week5 24 5 26.69883
## 230 2 10 week5 34 5 39.32882
## 231 2 11 week5 64 5 69.27317
## 232 2 12 week5 27 5 27.39816
## 233 2 13 week5 32 5 45.01529
## 234 2 14 week5 30 5 32.08696
## 235 2 15 week5 30 5 36.60715
## 236 2 16 week5 22 5 27.52478
## 237 2 17 week5 43 5 39.59186
## 238 2 18 week5 26 5 30.39464
## 239 2 19 week5 32 5 27.18649
## 240 2 20 week5 23 5 26.74269
## 241 1 1 week6 38 6 41.08812
## 242 1 2 week6 28 6 34.64282
## 243 1 3 week6 29 6 30.27841
## 244 1 4 week6 47 6 45.31393
## 245 1 5 week6 32 6 41.23154
## 246 1 6 week6 33 6 30.22263
## 247 1 7 week6 30 6 33.00005
## 248 1 8 week6 26 6 29.02011
## 249 1 9 week6 20 6 24.98602
## 250 1 10 week6 43 6 40.69851
## 251 1 11 week6 38 6 36.18102
## 252 1 12 week6 37 6 37.85194
## 253 1 13 week6 21 6 24.51549
## 254 1 14 week6 25 6 27.16267
## 255 1 15 week6 27 6 36.09272
## 256 1 16 week6 29 6 29.93934
## 257 1 17 week6 27 6 32.34248
## 258 1 18 week6 21 6 24.94572
## 259 1 19 week6 22 6 23.53975
## 260 1 20 week6 21 6 23.00359
## 261 2 1 week6 43 6 47.20110
## 262 2 2 week6 19 6 19.72331
## 263 2 3 week6 24 6 27.29993
## 264 2 4 week6 24 6 26.09313
## 265 2 5 week6 37 6 40.53006
## 266 2 6 week6 27 6 25.51532
## 267 2 7 week6 37 6 35.58352
## 268 2 8 week6 55 6 53.77292
## 269 2 9 week6 22 6 24.10367
## 270 2 10 week6 41 6 37.66060
## 271 2 11 week6 64 6 66.88551
## 272 2 12 week6 21 6 25.06709
## 273 2 13 week6 37 6 41.34332
## 274 2 14 week6 33 6 30.57705
## 275 2 15 week6 26 6 34.46834
## 276 2 16 week6 22 6 24.22404
## 277 2 17 week6 43 6 39.98068
## 278 2 18 week6 36 6 30.08618
## 279 2 19 week6 21 6 24.30943
## 280 2 20 week6 23 6 24.21797
## 281 1 1 week7 47 7 41.34777
## 282 1 2 week7 28 7 29.65674
## 283 1 3 week7 25 7 26.59007
## 284 1 4 week7 42 7 42.23125
## 285 1 5 week7 38 7 35.65391
## 286 1 6 week7 27 7 27.52475
## 287 1 7 week7 31 7 28.98032
## 288 1 8 week7 25 7 27.97104
## 289 1 9 week7 23 7 22.09061
## 290 1 10 week7 39 7 37.84888
## 291 1 11 week7 36 7 36.49682
## 292 1 12 week7 36 7 34.62110
## 293 1 13 week7 19 7 22.66388
## 294 1 14 week7 26 7 25.45819
## 295 1 15 week7 30 7 31.38048
## 296 1 16 week7 26 7 27.74019
## 297 1 17 week7 31 7 30.36756
## 298 1 18 week7 25 7 23.26988
## 299 1 19 week7 23 7 22.44029
## 300 1 20 week7 19 7 21.02931
## 301 2 1 week7 62 7 46.32252
## 302 2 2 week7 18 7 17.98912
## 303 2 3 week7 31 7 24.75527
## 304 2 4 week7 26 7 24.63625
## 305 2 5 week7 36 7 36.62499
## 306 2 6 week7 23 7 23.79522
## 307 2 7 week7 38 7 32.62969
## 308 2 8 week7 59 7 53.76830
## 309 2 9 week7 21 7 21.50850
## 310 2 10 week7 42 7 35.99238
## 311 2 11 week7 60 7 64.49785
## 312 2 12 week7 22 7 22.73602
## 313 2 13 week7 52 7 37.67134
## 314 2 14 week7 30 7 29.06713
## 315 2 15 week7 30 7 32.32952
## 316 2 16 week7 22 7 20.92331
## 317 2 17 week7 43 7 40.36951
## 318 2 18 week7 33 7 29.77772
## 319 2 19 week7 21 7 21.43238
## 320 2 20 week7 23 7 21.69325
## 321 1 1 week8 51 8 41.60742
## 322 1 2 week8 28 8 24.67067
## 323 1 3 week8 24 8 22.90173
## 324 1 4 week8 46 8 39.14856
## 325 1 5 week8 32 8 30.07628
## 326 1 6 week8 25 8 24.82686
## 327 1 7 week8 31 8 24.96058
## 328 1 8 week8 24 8 26.92198
## 329 1 9 week8 21 8 19.19521
## 330 1 10 week8 32 8 34.99925
## 331 1 11 week8 36 8 36.81262
## 332 1 12 week8 31 8 31.39026
## 333 1 13 week8 22 8 20.81226
## 334 1 14 week8 26 8 23.75372
## 335 1 15 week8 37 8 26.66825
## 336 1 16 week8 30 8 25.54103
## 337 1 17 week8 27 8 28.39263
## 338 1 18 week8 20 8 21.59405
## 339 1 19 week8 22 8 21.34084
## 340 1 20 week8 21 8 19.05504
## 341 2 1 week8 50 8 45.44395
## 342 2 2 week8 20 8 16.25493
## 343 2 3 week8 32 8 22.21061
## 344 2 4 week8 23 8 23.17936
## 345 2 5 week8 35 8 32.71993
## 346 2 6 week8 21 8 22.07513
## 347 2 7 week8 35 8 29.67587
## 348 2 8 week8 66 8 53.76368
## 349 2 9 week8 21 8 18.91334
## 350 2 10 week8 39 8 34.32416
## 351 2 11 week8 75 8 62.11019
## 352 2 12 week8 23 8 20.40495
## 353 2 13 week8 28 8 33.99936
## 354 2 14 week8 27 8 27.55722
## 355 2 15 week8 37 8 30.19070
## 356 2 16 week8 22 8 17.62258
## 357 2 17 week8 43 8 40.75834
## 358 2 18 week8 30 8 29.46925
## 359 2 19 week8 21 8 18.55532
## 360 2 20 week8 23 8 19.16853
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = Fitted, treatment = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Week", breaks = seq(0, max(BPRSL$week), 1)) +
scale_y_continuous(name = "Fitted BPRS") +
theme(legend.position = "top")
comparing fitted and original values we can see the data is a lot easier to understand. Visually it seems that with treatment 1, a smaller BPRS is achieved. …